json_file_path <- "data/mc2_challenge_graph.json"
mc2_file_path <- "data/mc2.rds"
if (!file.exists(mc2_file_path)) {
mc2 <- fromJSON(json_file_path)
saveRDS(mc2, mc2_file_path)
} else {
mc2 <- readRDS(mc2_file_path)
}Take-home Exercise 02
Detecting illegal, unreported, and unregulated (IUU) fishing
FishEye International has provided import/export data ranging from the year 2028 to 2034 regarding Oceanus’ fishing industries. This exercise attempts to make use of the data provided to detect those who are involved in IUU fishing, and in particular, to answer the following questions:
Use visual analytics to identify temporal patterns for individual entities and between entities in the knowledge graph FishEye created from trade records. Categorize the types of business relationship patterns you find.
Identify companies that fit a pattern of illegal fishing. Use visualizations to support your conclusions and your confidence in them.
Import data
The following code chunk is used to import the data. Since the provided data is in .json format, the fromJSON() function is used:
Importing data from the .json file takes time. Hence, an if-else loop is written here to ensure that the .json file only has to be imported once, after which, it will be saved as a .rds file. If the .rds file already exists, then it can be loaded directly with no need to re-run the .json file.
Data wrangling
Now that the data has been imported, we can load them as tibble dataframes. The select() function is used to select the relevant columns only and at the same time to re-order them into the desired order.
mc2_nodes <- as_tibble(mc2$nodes) %>%
select(id, shpcountry, rcvcountry)
mc2_edges <- as_tibble(mc2$links) %>%
mutate(ArrivalDate = ymd(arrivaldate)) %>%
mutate(Year = year(ArrivalDate)) %>%
select(
source,
target,
ArrivalDate,
Year,
hscode,
valueofgoods_omu,
volumeteu,
weightkg,
valueofgoodsusd
) %>%
distinct()For every edge, it specifies a trade between a source and a target. The following code helps to yield a dataframe mc2_nodes_extracted of unique entities that appear on either ends of every edge (either a source or target of every trading relationship). At the same time, since some of the names of entities are rather long, a cid is generated as an auto-incremented company ID, so that the companies can be referred to more easily.
mc2_nodes_extracted <- union(unique(mc2_edges$source),
unique(mc2_edges$target)) %>% sort() %>% as_tibble()
colnames(mc2_nodes_extracted) <- "name"
mc2_nodes_extracted <- mc2_nodes_extracted %>%
mutate(cid = row_number())Since every edge represents one transaction, having some sort of grouping would help to reduce the number of edges in the network.
A grp_hscode variable is generated as the first digit of the column hscode. HS codes are 6 digit numbers specifying the exact type of good that is being traded. As there are many types of goods, plotting them all based on hscode would be too messy. By extracting only the first digit of HS codes, some insight may still be gleaned from these broader categories (there will only be 9 categories, 1 for each digit).
Thereafter, the edges are grouped by source and target companies, grp_hscode and Year. The number of trades (num_trades) and total weight in kg (total_weightkg) are summarised for each group. Only trading relationships between two companies with a frequency of more than 20 per year are included in the network. This helps to filter out the low-frequency traders that are less likely to have substantial impact on the industry.
mc2_edges$grp_hscode <- substr(mc2_edges$hscode, 1, 1)
mc2_edges_agg <- mc2_edges %>%
group_by(source, target, grp_hscode, Year) %>%
summarise(num_trades = n(),
total_weightkg = sum(weightkg)) %>%
filter(source != target) %>%
filter(num_trades > 20) %>%
ungroup()Though the data dictionary specifies that more information can be gleaned by merging with the hscode table, there is no such table that can be found from the downloads. As such, we do not have the description of the types of goods being traded, and will have to assume for this exercise that all goods provided in this dataset are fish/marine life-related.
Plotting the network graph
Now, a tbl_graph object will be created for the purpose of plotting the network graph. At the same time, the betweenness centrality and out-degree centrality of each company (across all years of trade) will be generated for each company (represented as a node). Betweenness and out-degree are both determined based on number of trades.
The reason for these measures are for the detection of potential IUU fishing:
Companies with positive betweenness centrality would mostly likely be acting as intermediaries or distributors. Such companies would be important links for the industry network. Hence, they are less likely to be the ones involved in IUU fishing. On the contrary, if a company has zero betweenness centrality, it is unlikely to serve as an important link for the industry network.
Companies with positive out-degree centrality, in addition to having zero betweenness centrality, would likely be fishing companies who sell what they catch. However, legitimate fishing companies who are not trying to keep a low profile would in all likelihood also engage in some sort of buying (e.g., for bait, ship components, or for other business activities), and thus not have zero betweenness centrality.
Therefore, companies with zero betweenness centrality and positive out-degree centrality can be considered suspicious.
mc2_graph <- tbl_graph(nodes = mc2_nodes_extracted,
edges = mc2_edges_agg,
directed = TRUE) %>%
activate(nodes) %>%
mutate(betweenness_centrality = centrality_betweenness(weights = num_trades)) %>%
mutate(outdegree_centrality = centrality_degree(weights = num_trades,
mode = "out"))The following code chunk creates the ggraph object for the year 2028. The for-loop is designed for scalability (adding years into the years vector will create the respective years’ ggraph objects separately).
The following elements are parts of the design of the network:
Betweenness and out-degree centrality are re-generated for each node as the data is now filtered by each year.
filter(!node_is_isolated())function helps to remove nodes that have no corresponding edges for the year.Edge widths represent the number of trades.
Edge colours represent the type of goods being traded, based on
grp_hscode.Node sizes represent the betweenness centrality.
Node fills represent whether the out-degree centrality is zero or non-zero.
Interactive elements will be explained later on.
years = c("2028")
for (y in years) {
mygraph <- paste("mc2", "graph", y, sep = "_")
assign(
mygraph,
mc2_graph %>%
activate(edges) %>%
filter(Year == y) %>%
activate(nodes) %>%
filter(!node_is_isolated()) %>%
mutate(betweenness_centrality = centrality_betweenness(weights = num_trades)) %>%
mutate(outdegree_centrality = centrality_degree(weights = num_trades,
mode = "out"))
)
assign(
paste("g", y, sep = "_"),
ggraph(get(mygraph),
layout = "nicely") +
geom_edge_link(aes(width = num_trades,
color = grp_hscode),
alpha = 0.6) +
scale_edge_width(range = c(0.4, 4), name = "Total weight") +
scale_edge_color_brewer(name = "HS code group",
palette = "Set1") +
geom_point_interactive(
aes(
x = x,
y = y,
tooltip = paste0(
"Name: ", name,
"\nCompany ID: ", cid,
"\nOut-degree: ", outdegree_centrality,
"\nBetweenness: ", betweenness_centrality
),
data_id = outdegree_centrality > 0,
size = betweenness_centrality,
fill = outdegree_centrality > 0
),
colour = "grey20",
shape = 21,
alpha = 0.8
) +
scale_fill_manual(labels = c("Zero", "Non-zero"), values = c("cyan", "firebrick1"), name = "Out-degree") +
scale_size_continuous(range = (c(1, 10)), name = "Betweenness") +
theme_graph(
foreground = "grey20",
) +
labs(title = y) +
theme(plot.title = element_text(size = 11))
)
}
rm(y, years, mygraph)Network graph for year 2028
To have a sense of the scale of the network, only data from the year 2028 will be plotted using the following code chunk.
girafe(ggobj = g_2028,
options = list(opts_hover(css = "fill:;"),
opts_hover_inv(css = "opacity: 0.2;"),
opts_selection(type = "multiple", only_shiny = FALSE,
css = "opacity:1;"),
opts_selection_inv(css = "opacity:0;")))